NurseBridge (NB) wants to create a data driven predictive model allowing rural healthcare systems to make more informed decisions regarding compensation offerings at the most granular unit of time possible.
Goal is to use this data to categorize risk of death by day of the week on a county level.
We will start with monthly data and attempt to work into the day of the week. Due to anonymization of data, days of the week are specified but not the actual dates. I.e, All Fridays of the month are bundled into a single Friday variable.
Underlying cause of death 2018-21
5.Initial imports were performed on Texas and Oklahoma ranging from 2018-21 as these two locations were specifically mentioned by NB as some of their primary areas of focus.
adj_death values are calculated by substracting COVID
deaths from in-hospital deaths. crude_rate_10k values are
calculated by dividing adj_death values by
county_population values and multiplying by 10,000. This
gives a mortality rate per 10,000 people per county.
## Rows: 127,104
## Columns: 9
## $ year <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ wkday <fct> Sunday, Monday, Tuesday, Wednesday, Thursday, Fri…
## $ state <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, O…
## $ county <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair, …
## $ county_code <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001, …
## $ county_population <dbl> 22082, 22082, 22082, 22082, 22082, 22082, 22082, …
## $ wkday_adj_deaths <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ wkday_crude_rate_10k <dbl> 0.4528575, 0.4528575, 0.0000000, 0.4528575, 0.000…
## Rows: 15,888
## Columns: 8
## $ year <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,…
## $ date <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01…
## $ state <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK,…
## $ county <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair…
## $ county_code <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001…
## $ monthly_adj_deaths <dbl> 16, 20, 15, 14, 1, 11, 10, 17, 1, 11, 1, 17, 1,…
## $ monthly_crude_rate_10k <dbl> 7.2457205, 9.0571506, 6.7928630, 6.3400054, 0.4…
## Rows: 1,324
## Columns: 7
## $ year <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ state <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, …
## $ county <fct> Adair, Alfalfa, Atoka, Beaver, Beckham, Blaine, …
## $ county_code <dbl> 40001, 40003, 40005, 40007, 40009, 40011, 40013,…
## $ county_population <dbl> 22082, 5754, 13838, 5319, 21709, 9485, 47192, 28…
## $ annual_adj_deaths <dbl> 156, 39, 88, 38, 145, 79, 322, 258, 707, 349, 31…
## $ annual_crude_rate_10k <dbl> 70.64577, 67.77894, 63.59300, 71.44200, 66.79257…
Early on in EDA it became evident that some replaced
Suppressed values were skewing data in counties with low
populations. For example, Loving, TX had just a population of 57, but
with 5 deaths the crude rate is at a dramatic 877 deaths per 10k
people.
df_annual_5 %>%
select(state, county, county_population, annual_adj_deaths, annual_crude_rate_10k) %>%
arrange(-annual_crude_rate_10k) %>%
slice(1:100)When Suppressed values are replaced with 5, we also see it’s possible
to have negative death rates. For example, if there are 5 months in a
year with Suppressed COVID hospital deaths, they are all converted to
the value 5. The summarised value would then be 25 COVID hospital
deaths. The non-suppressed data shows, however, that the total
in-hospital deaths over that same year is 10. Since our adjusted death
rate subtracts in hospital deaths from COVID deaths we end up with
10 - 25 = -15 in-hospital, non-COVID deaths. Logically,
this doesn’t make sense and we could assume the true suppressed values
are between 1 and 2 for each fo those months.
This histogram below illustrates the issue with using a value of 5 for Suppressed values, resulting in negative value as well as dramatically high rates for low population counties.
If we replace values with 1 instead of 5 we see the distribution follows a more normal curve as would be expected. However, we do see that some rates appear to still fall into the negative ranges. These specific examples are again artifacts of our imperfect replacement with 1.
ggplotly(
df_annual_neg %>%
ggplot(aes(x=annual_crude_rate_10k))+
geom_histogram(bins=50, fill=NA, color="black")+
geom_vline(xintercept = 0, color="red")+
labs(
title="Better distribution of annual crude death rates"
)
)Because we don’t want to remove any counties from representation we
will simply add 3 to the annual_hospital_deaths values so
that the rate remains positive. An ideal solution may have been to
perform a more advanced imputation with a glm, but this gets us in the
right direction.
The rows in question with negative values.
Modifying said rows and displaying the final adjusted calculated variables.
A boxplot shows the outlier clearly.
df_annual %>%
ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
geom_boxplot()+
facet_wrap(~state, scales='free_y')If we filter out such low populations areas we may negatively impact certain counties. There are a few options:
We can see the distributions change subtly for Texas. Also noteworthy is that some adjusted death rates end up negative and may just need reset to a 0 value. \[Adjusted Deaths = In Hospital Deaths - COVID Deaths\] Annual COVID deaths are calculated by summarising monthly COVID deaths. In some cases these are all very low values, and substituting with all 5’s may actually increase the number significantly above the in hospital death count therefore producing a negative number.
g1 <- df_annual %>%
filter(county != "Loving") %>%
ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
geom_boxplot(outlier.colour = NA)+
geom_jitter(alpha=.1, width=.1)+
facet_wrap(~state, scales='free_y')+
labs(
title = "Death rate over time by State",
subtitle = "without Loving, TX"
)+
scale_y_continuous(breaks = seq(0,200,50))
g2 <- df_annual %>%
filter(annual_adj_deaths > 10) %>%
ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
geom_boxplot(outlier.colour = NA)+
geom_jitter(alpha=.1, width=.1)+
facet_wrap(~state, scales='free_y')+
labs(
title = "Death rate over time by State",
subtitle = "Adjusted deaths > 10 only",
)+
scale_y_continuous(breaks = seq(0,200,50))
grid.arrange(g1,g2)Overall rates seem to have dropped off over time.
df_annual %>%
filter(annual_crude_rate_10k < 750) %>%
mutate(annual_crude_rate_10k = ifelse(annual_crude_rate_10k < 1, 0, annual_crude_rate_10k)) %>%
ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
geom_boxplot()+
facet_wrap(~state)df_monthly %>%
group_by(year, month, state) %>%
summarise(crude_mean = mean(monthly_crude_rate_10k)) %>%
mutate(date = ymd(paste0(as.character(year), "-", as.character(month), "-01"))) %>%
ggplot(aes(x=date, y=crude_mean, group=year))+
geom_boxplot(outlier.colour = NA)+
geom_jitter(alpha=.4, width=.04)+
facet_wrap(~state)df_monthly %>%
filter(state == "OK") %>%
filter(monthly_crude_rate_10k < 750) %>%
ggplot(aes(x=date, y=monthly_crude_rate_10k, group=county))+
geom_point()+
geom_line(aes(group=county))+
facet_wrap(~county)